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Abstract 

Entity  detection  and  tracking  is  a  relatively  new 
addition  to  the  repertoire  of  natural  language 
tasks.  In  this  paper,  we  present  a  statistical 
language-independent  framework  for  identify¬ 
ing  and  tracking  named,  nominal  and  pronom¬ 
inal  references  to  entities  within  unrestricted 
text  documents,  and  chaining  them  into  clusters 
corresponding  to  each  logical  entity  present  in 
the  text.  Both  the  mention  detection  model 
and  the  novel  entity  tracking  model  can  use 
arbitrary  feature  types,  being  able  to  integrate 
a  wide  array  of  lexical,  syntactic  and  seman¬ 
tic  features.  In  addition,  the  mention  detec¬ 
tion  model  crucially  uses  feature  streams  de¬ 
rived  from  different  named  entity  classifiers. 

The  proposed  framework  is  evaluated  with  sev¬ 
eral  experiments  run  in  Arabic,  Chinese  and 
English  texts;  a  system  based  on  the  approach 
described  here  and  submitted  to  the  latest  Au¬ 
tomatic  Content  Extraction  (ACE)  evaluation 
achieved  top-tier  results  in  all  three  evaluation 
languages. 

1  Introduction 

Detecting  entities,  whether  named,  nominal  or  pronom¬ 
inal,  in  unrestricted  text  is  a  crucial  step  toward  under¬ 
standing  the  text,  as  it  identifies  the  important  concep¬ 
tual  objects  in  a  discourse.  It  is  also  a  necessary  step  for 
identifying  the  relations  present  in  the  text  and  populating 
a  knowledge  database.  This  task  has  applications  in  in¬ 
formation  extraction  and  summarization,  information  re¬ 
trieval  (one  can  get  all  hits  for  Washington/person  and  not 
the  ones  for  Washington/state  or  Washington/city),  data 
mining  and  question  answering. 

The  Entity  Detection  and  Tracking  task  (EDT  hence¬ 
forth)  has  close  ties  to  the  named  entity  recognition 
(NER)  and  coreference  resolution  tasks,  which  have  been 
the  focus  of  attention  of  much  investigation  in  the  recent 
past  (Bikel  et  al.,  1997;  Borthwick  et  al.,  1998;  Mikheev 
et  al.,  1999;  Miller  et  al.,  1998;  Aberdeen  et  al.,  1995; 


Ng  and  Cardie,  2002;  Soon  et  al.,  2001),  and  have  been 
at  the  center  of  several  evaluations:  MUC-6,  MUC-7, 
CoNLL’02  and  CoNLL’03  shared  tasks.  Usually,  in  com¬ 
putational  linguistic  literature,  a  named  entity  represents 
an  instance  of  a  name,  either  a  location,  a  person,  an  or¬ 
ganization,  and  the  NER  task  consists  of  identifying  each 
individual  occurrence  of  such  an  entity.  We  will  instead 
adopt  the  nomenclature  of  the  Automatic  Content  Extrac¬ 
tion  program^  (NIST,  2003a):  we  will  call  the  instances 
of  textual  references  to  objects  or  abstractions  mentions, 
which  can  be  either  named  (e.g.  John  Mayor),  nominal 
(e.g.  the  president)  or  pronominal  (e.g.  she,  it).  An  entity 
consists  of  all  the  mentions  (of  any  level)  which  refer  to 
one  conceptual  entity.  Eor  instance,  in  the  sentence 
President  John  Smith  said  h£  has  no  comments, 
there  are  two  mentions:  John  Smith  and  he  (in  the  order 
of  appearance,  their  levels  are  named  and  pronominal), 
but  one  entity,  formed  by  the  set  {  John  Smith,  he}. 

In  this  paper,  we  present  a  general  statistical  frame¬ 
work  for  entity  detection  and  tracking  in  unrestricted  text. 
The  framework  is  not  language  specific,  as  proved  by  ap¬ 
plying  it  to  three  radically  different  languages:  Arabic, 
Chinese  and  English.  We  separate  the  EDT  task  into  a 
mention  detection  part  -  the  task  of  finding  all  mentions 
in  the  text  -  and  an  entity  tracking  part  -  the  task  of  com¬ 
bining  the  detected  mentions  into  groups  of  references  to 
the  same  object. 

The  work  presented  here  is  motivated  by  the  ACE  eval¬ 
uation  framework,  which  has  the  more  general  goal  of 
building  multilingual  systems  which  detect  not  only  enti¬ 
ties,  but  also  relations  among  them  and,  more  recently, 
events  in  which  they  participate.  The  EDT  task  is  ar¬ 
guably  harder  than  traditional  named  entity  recognition, 
because  of  the  additional  complexity  involved  in  extract¬ 
ing  non-named  mentions  (nominals  and  pronouns)  and 
the  requirement  of  grouping  mentions  into  entities. 

We  present  and  evaluate  empirically  statistical  mod¬ 
els  for  both  mention  detection  and  entity  tracking  prob¬ 
lems.  Eor  mention  detection  we  use  approaches  based  on 
Maximum  Entropy  (MaxEnt  henceforth)  (Berger  et  al., 
1996)  and  Robust  Risk  Minimization  (RRM  henceforth) 

*For  a  description  of  the  ACE  program  see 

http : / / WWW . nist . gov/ speech/ tests/ ace/. 
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(Zhang  et  al.,  2002).  The  task  is  transformed  into  a  se¬ 
quence  classification  problem.  We  investigate  a  wide  ar¬ 
ray  of  lexical,  syntactic  and  semantic  features  to  perform 
the  mention  detection  and  classification  task  including, 
for  all  three  languages,  features  based  on  pre-existing  sta¬ 
tistical  semantic  taggers,  even  though  these  taggers  have 
been  trained  on  different  corpora  and  use  different  seman¬ 
tic  categories.  Moreover,  the  presented  approach  implic¬ 
itly  learns  the  correlation  between  these  different  seman¬ 
tic  types  and  the  desired  output  types. 

We  propose  a  novel  MaxEnt-based  model  for  predict¬ 
ing  whether  a  mention  should  or  should  not  be  linked  to 
an  existing  entity,  and  show  how  this  model  can  be  used 
to  build  entity  chains.  The  effectiveness  of  the  approach 
is  tested  by  applying  it  on  data  from  the  above  mentioned 
languages  —  Arabic,  Chinese,  English. 

The  framework  presented  in  this  paper  is  language- 
universal  -  the  classification  method  does  not  make  any 
assumption  about  the  type  of  input.  Most  of  the  fea¬ 
ture  types  are  shared  across  the  languages,  but  there  are  a 
small  number  of  useful  feature  types  which  are  language- 
specific,  especially  for  the  mention  detection  task. 

The  paper  is  organized  as  follows:  Section  2  describes 
the  algorithms  and  feature  types  used  for  mention  detec¬ 
tion.  Section  3  presents  our  approach  to  entity  tracking. 
Section  4  describes  the  experimental  framework  and  the 
systems’  results  for  Arabic,  Chinese  and  English  on  the 
data  from  the  latest  ACE  evaluation  (September  2003),  an 
investigation  of  the  effect  of  using  different  feature  types, 
as  well  as  a  discussion  of  the  results. 

2  Mention  Detection 

The  mention  detection  system  identifies  the  named,  nom¬ 
inal  and  pronominal  mentions  introduced  in  the  previous 
section.  Similarly  to  classical  NLP  tasks  such  as  base 
noun  phrase  chunking  (Ramshaw  and  Marcus,  1994),  text 
chunking  (Ramshaw  and  Marcus,  1995)  or  named  entity 
recognition  (Tjong  Kim  Sang,  2002),  we  formulate  the 
mention  detection  problem  as  a  classification  problem, 
by  assigning  to  each  token  in  the  text  a  label,  indicating 
whether  it  starts  a  specific  mention,  is  inside  a  specific 
mention,  or  is  outside  any  mentions. 

2.1  The  Statistical  Classifiers 

Good  performance  in  many  natural  language  process¬ 
ing  tasks,  such  as  part-of-speech  tagging,  shallow  pars¬ 
ing  and  named  entity  recognition,  has  been  shown  to  de¬ 
pend  heavily  on  integrating  many  sources  of  information 
(Zhang  et  al.,  2002;  Jing  et  al.,  2003;  Ittycheriah  et  al., 
2003).  Given  the  stated  focus  of  integrating  many  feature 
types,  we  are  interested  in  algorithms  that  can  easily  in¬ 
tegrate  and  make  effective  use  of  diverse  input  types.  We 
selected  two  methods  which  satisfy  these  criteria:  a  linear 
classifier  -  the  Robust  Risk  Minimization  classifier  -  and 
a  log-linear  classifier  -  the  Maximum  Entropy  classifier. 
Both  methods  can  integrate  arbitrary  types  of  informa¬ 
tion  and  make  a  classification  decision  by  aggregating  all 
information  available  for  a  given  classification. 


Before  formally  describing  the  methods^,  we  introduce 
some  notations:  let  C  =  {ci, . . . ,  c„}  be  the  set  of  pre¬ 
dicted  classes,  X  be  the  example  space  and  T  =  {0, 1}™ 
be  the  feature  space.  Each  example  x  €  X  has  associated 
a  vector  of  binary  features  /  (x)  =  (/i  (x)  , . . . ,  fm  (x))- 
We  also  assume  the  existence  of  a  training  data  set  T  C 
X  and  a  test  set  £  C  X. 

The  RRM  algorithm  (Zhang  et  al.,  2002)  constructs  n 
linear  classifiers  ^^(one  for  each  predicted  class), 

each  predicting  whether  the  current  example  belongs  to 
the  class  or  not.  Every  such  classifier  Ci  has  an  associ¬ 
ated  feature  weight  vector,  {wij) which  is  learned 
during  the  training  phase  so  as  to  minimize  the  classifica¬ 
tion  error  rate^.  At  test  time,  for  each  example  x  €  £,  the 
model  computes  a  score 

m 

ai  (x)  =  Wij  ■  fj  (x) 
j=i 

and  labels  the  example  with  either  the  class  correspond¬ 
ing  to  the  classifier  with  the  highest  score,  if  above 
0,  or  outside,  otherwise.  The  full  decoding  algorithm 
is  presented  in  Algorithm  1.  This  algorithm  can  also 
be  used  for  sequence  classification  (Williams  and  Peng, 
1990),  by  converting  the  activation  scores  into  probabili¬ 
ties  (through  the  soft-max  function,  for  instance)  and  us¬ 
ing  the  standard  dynamic  programing  search  algorithm 
(also  known  as  Viterbi  search). 


Algorithm  1  The  RRM  Decoding  Algorithm 
foreach  x  £  £ 
foreach i  =  1..  .n 

at  [a:]  =  ’  fj 

class  (x)  ^  arg  max^  at  [x] 


Somewhat  similarly,  the  MaxEnt  algorithm  has  an  as¬ 
sociated  set  of  weights  which  are  estimated 

during  the  training  phase  so  as  to  maximize  the  likelihood 
of  the  data  (Berger  et  al.,  1996).  Given  these  weights,  the 
model  computes  the  probability  distribution  of  a  particu¬ 
lar  example  x  as  follows: 

1  m 

^  (g  |a:)  =  -  ali  ^  =  1^11 

j=l  i  j 

where  Z  is  a  normalization  factor. 

After  computing  the  class  probability  distribution,  the 
assigned  class  is  the  most  probable  one  a  posteriori.  The 
sketch  of  applying  MaxEnt  to  the  test  data  is  presented 
in  Algorithm  2.  Similarly  to  the  RRM  model,  we  use 
the  model  to  perform  sequence  classification,  through  dy¬ 
namic  programing. 

^This  is  not  meant  to  be  an  in-depth  introduction  to  the  meth¬ 
ods,  but  a  brief  overview  to  familiarize  the  reader  with  them. 

^Actually,  the  optimizing  function  contains  a  regularization 
factor  which  considerably  improves  the  robustness  of  the  sys¬ 
tem  —  for  full  details,  see  Zhang  et  al.  (2002). 


Algorithm  2  The  MaxEnt  Decoding  Algorithm 
foreach  x  £  £ 

2  -s-  0 

foreach i  =  1 .  ..n 

Pi[x]  =  n 

j=i 

Normalize  (p) 
class{x)  ^  argmaxiPi[^] 


Within  this  framework,  any  type  of  feature  can  be  used, 
enabling  the  system  designer  to  experiment  with  interest¬ 
ing  feature  types,  rather  than  worry  about  specific  feature 
interactions.  In  contrast,  in  a  rule  based  system,  the  sys¬ 
tem  designer  would  have  to  consider  how,  for  instance, 
a  WordNet  (Miller,  1995)  derived  information  for  a  par¬ 
ticular  example  interacts  with  a  part-of-speech-based  in¬ 
formation  and  chunking  information.  That  is  not  to  say, 
ultimately,  that  rule-based  systems  are  in  some  way  infe¬ 
rior  to  statistical  models  -  they  are  built  using  valuable 
insight  which  is  hard  to  obtain  from  a  statistical-model- 
only  approach.  Instead,  we  are  just  suggesting  that  the 
output  of  such  a  system  can  be  easily  integrated  into  the 
previously  described  framework,  as  one  of  the  input  fea¬ 
tures,  most  likely  leading  to  improved  performance. 

2.2  The  Comhination  Hypothesis 

In  addition  to  using  rich  lexical,  syntactic,  and  semantic 
features,  we  leveraged  several  pre-existing  mention  tag¬ 
gers.  These  pre-existing  taggers  were  trained  on  datasets 
outside  of  ACE  training  data  and  they  identify  types  of 
mentions  different  from  the  ACE  types  of  mentions  .  Eor 
instance,  a  pre-existing  tagger  may  identify  dates  or  oc¬ 
cupation  mentions  (not  used  in  ACE),  among  other  types. 
It  could  also  have  a  class  called  PERSON,  but  the  anno¬ 
tation  guideline  of  what  represents  a  PERSON  may  not 
match  exactly  to  the  notion  of  the  PERSON  type  in  ACE. 
Our  hypothesis  -  the  combination  hypothesis  -  is  that 
combining  pre-existing  classifiers  from  diverse  sources 
will  boost  performance  by  injecting  complementary  in¬ 
formation  into  the  mention  detection  models.  Hence,  we 
used  the  output  of  these  pre-existing  taggers  and  used 
them  as  additional  feature  streams  for  the  mention  de¬ 
tection  models.  This  approach  allows  the  system  to  au¬ 
tomatically  correlate  the  (different)  mention  types  to  the 
desired  output. 

2.3  Language-Independent  Features 

Even  if  the  three  languages  (Arabic,  Chinese  and  English) 
are  radically  different  syntacticly,  semantically,  and  even 
graphically,  all  models  use  a  few  universal  types  of  fea¬ 
tures,  while  others  are  language-specific.  Let  us  note 
again  that,  while  some  types  of  features  only  apply  to 
one  language,  the  models  have  the  same  basic  structure, 
treating  the  problem  as  an  abstract  classification  task. 

The  following  is  a  list  of  the  features  that  are  shared 
across  languages  (wi  is  considered  by  default  the  current 
token): 


•  tokens'*  in  a  window  of  5:  {'Wi^2,  ■  ■  ■ ,  Wi+2}'-, 

•  the  part-of-speech  associated  with  token  Wi 

•  dictionary  information  (whether  the  current  token 
is  part  of  a  large  collection  of  dictionaries  -  one 
boolean  value  for  each  dictionary) 

•  the  output  of  named  mention  detectors  trained  on 
different  style  of  entities. 

•  the  previously  assigned  classification  tags^. 

The  following  sections  describe  in  detail  the  language- 
specific  features,  and  Table  1  summarizes  the  feature 
types  used  in  building  the  models  in  the  three  languages. 
Einally,  the  experiments  in  Section  4  detail  the  perfor¬ 
mance  obtained  by  using  selected  combinations  of  fea¬ 
ture  subsets. 

2.4  Arabic  Mention  Detection 

Arabic,  a  highly  inflected  language,  has  linguistic  pecu¬ 
liarities  that  affect  any  mention  detection  system.  An  im¬ 
portant  aspect  that  needs  to  be  addressed  is  segmentation: 
which  style  should  be  used,  how  to  deal  with  the  inher¬ 
ent  segmentation  ambiguity  of  mention  names,  especially 
persons  and  locations,  and,  finally,  how  to  handle  the  at¬ 
tachment  of  pronouns  to  stems.  Arabic  blank-delimited 
words  are  composed  of  zero  or  more  prefixes,  followed 
by  a  stem  and  zero  or  more  suffixes.  Each  prefix,  stem  or 
suffix  will  be  called  a  token  in  this  discussion;  any  con¬ 
tiguous  sequence  of  tokens  can  represent  a  mention. 

Eor  example,  the  word  “trwmAn”  (translation:  “Tru¬ 
man”)  could  be  segmented  in  3  tokens  (for  instance,  if 
the  word  was  not  seen  in  the  training  data): 

trwmAn  =>t-|-rwm-|- An 

which  introduces  ambiguity,  as  the  three  tokens  form  re¬ 
ally  just  one  mention,  and,  in  the  case  of  the  word  “tm- 
nEh”,  which  has  the  segmentation 

tmnEh=>t-|-mnE-|-h 

the  first  and  third  tokens  should  both  be  labeled  as 
pronominal  mentions  -  but,  to  do  this,  they  need  to  be 
separated  from  the  stem  mnE. 

Pragmatically,  we  found  segmenting  Arabic  text  to  be  a 
necessary  and  beneficial  process  due  mainly  to  two  facts: 

1.  some  prefixes/suffixes  can  receive  a  different  men¬ 
tion  type  than  the  stem  they  are  glued  to  (for  in¬ 
stance,  in  the  case  of  pronouns); 

2.  keeping  words  together  results  in  significant  data 
sparseness,  because  of  the  inflected  nature  of  the 
language. 

*Each  language  may  have  a  different  notion  of  what  repre¬ 
sents  a  token. 

^In  the  current  implementation,  the  models  use  a  history  of 
2  tags. 


Feature  Type 

Ar 

Zh 

En 

Token  in  window  of  5 

s/ 

s/ 

x/ 

Morph  in  window  of  5 

\/ 

N/A 

x/ 

POS  info 

s/ 

s/ 

x/ 

Text  chunking  info 

— 

— 

x/ 

Capitalization/word-type 

N/A 

N/A 

x/ 

Prefixes/suffixes 

s/ 

N/A 

x/ 

Gazetteer  info 

sj 

s/ 

x/ 

Gap 

— 

— 

x/ 

Wordnet  info 

— 

— 

x/ 

Segmentation 

x/ 

x/ 

N/A 

Additional  systems’  output 

s/ 

x/ 

x/ 

Table  1 :  Summary  of  features  used  by  the  3  systems 

Given  these  observations,  we  decided  to  “condition”  the 
output  of  the  system  on  the  segmented  data:  the  text  is 
first  segmented  into  tokens,  and  the  classification  is  then 
performed  on  tokens.  The  segmentation  model  is  similar 
to  the  one  presented  by  Lee  et  al.  (2003),  and  obtains  an 
accuracy  of  about  98%. 

In  addition,  special  attention  is  paid  to  prefixes  and  suf¬ 
fixes:  in  order  to  reduce  the  number  of  spurious  tokens 
we  re-merge  the  prefixes  or  suffixes  to  their  correspond¬ 
ing  stem  if  they  are  not  essential  to  the  classification  pro¬ 
cess.  For  this  purpose,  we  collect  the  following  statistics 
for  each  prefix/suffix  ps  from  the  ACE  training  data:  the 
frequency  of  ps  occurring  as  a  mention  by  itself  (M)  and 
the  frequency  of  ps  occurring  as  a  part  of  mention  (P). 
If  the  ratio  ^  is  below  a  threshold  (estimated  on  the  de¬ 
velopment  data),  ps  is  re-merged  with  its  corresponding 
stem.  Only  few  prefixes  and  suffixes  were  merged  using 
these  criteria.  This  is  appropriate  for  the  ACE  task,  since 
a  large  percentage  of  prefixes  and  suffixes  are  annotated 
as  pronoun  mentions®. 

In  addition  to  the  language-general  features  described 
in  Section  2.3,  the  Arabic  system  implements  a  feature 
specifying  for  each  token  its  original  stem. 

Eor  this  system,  the  gazetteer  features  are  computed  on 
words,  not  on  tokens;  the  gazetteers  consist  of  12000  per¬ 
son  names  and  3000  location  and  country  names,  all  of 
which  have  been  collected  by  few  man-hours  web  brows¬ 
ing.  The  system  also  uses  features  based  on  the  output 
of  three  additional  mention  detection  classifiers:  a  RRM 
model  predicting  48  mention  categories,  a  RRM  model 
and  a  HMM  model  predicting  32  mention  categories. 

2.5  Chinese  Mention  Detection 

In  Chinese  text,  unlike  in  Indo-European  languages, 
words  neither  are  white-space  delimited  nor  do  they  have 
capitalization  markers.  Instead  of  a  word-based  model, 
we  build  a  character-based  one,  since  word  segmentation 

®For  some  additional  data,  annotated  with  32  named  cate¬ 
gories,  mentioned  later  on,  we  use  the  same  approach  of  col¬ 
lecting  the  M  and  P  statistics,  but,  since  named  mentions  are 
predominant  and  there  are  no  pronominal  mentions  in  that  case, 
most  suffixes  and  some  prefixes  are  merged  back  to  their  origi¬ 
nal  stem. 


errors  can  lead  to  irrecoverable  mention  detection  errors; 
Jing  et  al.  (2003)  also  observe  that  character-based  mod¬ 
els  are  better  performing  than  word-based  ones  for  Chi¬ 
nese  named  entity  recognition.  Although  the  model  is 
character-based,  segmentation  information  is  still  useful 
and  is  integrated  as  an  additional  feature  stream. 

Some  more  information  about  additional  resources 
used  in  building  the  system: 

•  Gazetteers  include  dictionaries  of  10k  person 
names,  8k  location  and  country  names,  and  3k  orga¬ 
nization  names,  compiled  from  annotated  corpora. 

•  There  are  four  additional  classifiers  whose  output  is 
used  as  features:  a  RRM  model  which  outputs  32 
named  categories,  a  RRM  model  identifying  49  cat¬ 
egories,  a  RRM  model  identifying  45  mention  cat¬ 
egories,  and  a  RRM  model  that  classifies  whether  a 
character  is  an  English  character,  a  numeral  or  other. 

2.6  English  Mention  Detection 

The  English  mention  detection  model  is  similar  to  the 
system  described  in  (Ittycheriah  et  al.,  2003)^.The  fol¬ 
lowing  is  a  list  of  additional  features  (again,  Wi  is  the 
current  token): 

•  Shallow  parsing  information  associated  with  the  to¬ 
kens  in  window  of  3; 

•  Prefixes/suffixes  of  length  up  to  4; 

•  A  capitalization/word-type  flag  (similar  to  the  ones 
described  by  Bikel  et  al.  (1997)); 

•  Gazetteer  information:  a  handful  of  location  (55k 
entries)  person  names  (30k)  and  organizations  (5k) 
dictionaries; 

•  A  combination  of  gazetteer,  POS  and  capitalization 
information,  obtained  as  follows:  if  the  word  is  a 
closed-class  word  —  select  its  class,  else  if  it’s  in 
a  dictionary  —  select  that  class,  otherwise  back-off 
to  its  capitalization  information;  we  call  this  feature 
gap; 

•  WordNet  information  (the  synsets  and  hypernyms  of 
the  two  most  frequent  senses  of  the  word); 

•  The  outputs  of  three  systems  (HMM,  RRM  and 
MaxEnt)  trained  on  a  32-category  named  entity  data, 
the  output  of  an  RRM  system  trained  on  the  MUC-6 
data,  and  the  output  of  RRM  model  identifying  49 
categories. 

3  Entity  Tracking 

This  section  introduces  a  novel  statistical  approach  to  en¬ 
tity  tracking.  We  choose  to  model  the  process  of  forming 
entities  from  mentions,  one  step  at  a  time.  The  process 
works  from  left  to  right:  it  starts  with  an  initial  entity 
consisting  of  the  first  mention  of  a  document,  and  the  next 
mention  is  processed  by  either  linking  it  with  one  of  the 

’The  main  difference  between  their  system  and  ours  is  that 
they  build  a  MaxEnt  model  capable  of  building  hierarchical 
structures  -  therefore  treating  the  problem  as  a  parsing  task  - 
while  our  system  treats  the  problem  as  a  classification  task. 


existing  entities,  or  starting  a  new  entity.  The  process 
could  have  as  output  any  one  of  the  possible  partitions  of 
the  mention  set.*  Two  separate  models  are  used  to  score 
the  linking  and  starting  actions,  respectively. 

3.1  Tracking  Algorithm 

Formally,  let  {mj  :  1  <  i  <  n}  he  n  mentions  in  a 
document.  Let  g  :  i  j  he  the  map  from  mention  index 
i  to  entity  index  j.  For  a  mention  index  fc(l  <  k  <  n), 
let  us  define 

Jk  = 

the  set  of  indices  of  the  partially-established  entities  to 
the  left  of  rrik  (note  that  Ji  =  0),  and 

:  t  G  Tfc}, 

the  set  of  the  partially-established  entities. 

Given  that  Ei-  has  been  formed  to  the  left  of  the  ac¬ 
tive  mention  rrik,  rrik  can  take  two  possible  actions:  if 
g{k)  G  Jk,  then  the  active  mention  rrik  is  said  to  link  with 
the  entity  Cgf^k)  i  Otherwise  it  starts  a  new  entity  Cgf^k)  ■  At 
training  time,  the  action  is  known  to  us,  and  at  testing 
time,  both  hypotheses  will  be  kept  during  search.  Notice 
that  a  sequence  of  such  actions  corresponds  uniquely  to 
an  entity  outcome  (or  a  partition  of  mentions).  There¬ 
fore,  the  problem  of  coreference  resolution  is  equivalent 
to  ranking  the  action  sequences. 

In  this  work,  a  binary  model  P{L  =  l\Ek,  rrik,  A  =  t) 
is  used  to  compute  the  link  probability,  where  f  G  Jk,  L 
is  1  iff  rrik  links  with  e^;  the  random  variable  A  is  the 
index  of  the  partial  entity  to  which  rrik  is  linking.  Since 
starting  a  new  entity  means  that  rrik  does  not  link  with 
any  entities  in  Ek,  the  probability  of  starting  a  new  entity, 
P{L  =  0\Ek,  rrik),  can  be  computed  as 

P{L  =  0\Ek,mk)  =  ^  P{L  =  0,A  =  t\Ek,mk)  = 

tEJk 

1  -  ^  P{A  =  t\Ek,mk)P{L  =  l\Ek,mk,A  =  t) 

tEJk 

(1) 

Therefore,  the  probability  of  starting  an  entity  can 
be  computed  using  the  linking  probabilities  P{L  = 
l\Ek,mk,  A  =  t),  provided  that  the  marginal  P{A  = 
t\Ek,mk)  is  known.  While  other  models  are  possible,  in 
the  results  reported  in  this  paper,  P{A  =  t\Ek,mk)  is 
approximated  as: 

{1  iff  =  argmaxiej^ 

P{L  =  l\Ek,mk,A  =  i) 
0  otherwise 

(2) 

*The  number  of  all  possible  partitions  of  a  set  is  given  by 
the  Bell  number  (Bell,  1934).  This  number  is  very  large  even 
for  a  document  with  a  moderate  number  of  mentions:  about 
51.7  trillion  for  a  20-mention  document.  For  practical  reasons, 
the  search  space  has  to  be  reduced  to  a  reasonably  small  set  of 
hypotheses. 


That  is,  the  starting  probability  is  just  one  minus  the  max¬ 
imum  linking  probability. 

Training  directly  the  model  P{L  =  l\Ek,mk,  A  =  i) 
is  difficult  since  it  depends  on  all  partial  entities  Ek  ■  As 
a  first  attempt  of  modeling  the  process  from  mentions  to 
entities,  we  make  the  following  modeling  assumptions: 

P{L  =  l\Ek,mk,A  =  i)  tv  P{L  =  Ije^,  rrik)  (3) 
«  max  P(I/ =  l|m,  m^;).  (4) 

mEei 

Once  the  linking  probability  P{L  =  l\Ek,mk,  A  =  i) 
is  available,  the  starting  probability  P{L  =  0\Ek,mk) 
can  be  computed  using  (1)  and  (2).  The  strategy  used  to 
find  the  best  set  of  entities  is  shown  in  Algorithm  3. 


Algorithm  3  Coreference  Decoding  Algorithm 

Input:  mentions  in  text  M  =  {rrii  :  1, ...  ,n} 

Output:  a  partition  E  of  the  set  M 
n  {Eo  =  {{mi }}}  ;  scr  (Eq)  =  1 

foreach  k  =  2, . . .  ,n 

n'  ^9 

foreach  E 

E'  i—E  U  {{rufc}} 

scr  {E')  scr  (E)  ■  P  {L  =  0\E,  mk) 

n'  {E'} 

foreach  i  £  Jk 

E'  ^  {E\  {a})  U  {Ci  U  {mk}} 

scr  {E')  ■£-  scr  (E)  ■  P  {L  =  1|P,  mk,  A  =  i) 

n'  Li  {P'} 

JL  £-  prune{Ji') 

return  arg  max  scr  (E) 

E^l-L 


3.2  Entity  Tracking  Features 

A  maximum  entropy  model  is  used  to  implement  (4). 

Atomic  features  used  by  the  model  include: 

•  string  match  -  whether  or  not  the  mention  strings  of 
m  and  mk  are  exactly  match,  or  partially  match; 

•  context  -  surrounding  words  or  part-of-speech  tags 
(if  available)  of  mentions  m,  mk', 

•  mention  count  -  how  many  times  a  mention  string 
appears  in  the  document.  The  count  is  quantized; 

•  distance  -  distance  between  the  two  mentions  in 
words  and  sentences.  This  number  is  also  quantized; 

•  editing  distance  -  quantized  editing  distance  be¬ 
tween  the  two  mentions; 

•  mention  information  -  spellings  of  the  two  mentions 
and  other  information  (such  as  POS  tags)  if  avail¬ 
able;  If  a  mention  is  a  pronoun,  the  feature  also  com¬ 
putes  gender,  plurality,  possessiveness  and  reflexive¬ 
ness; 

•  acronym  -  whether  or  not  one  mention  is  the 
acronym  of  the  other  mention; 

•  syntactic  features  -  whether  or  not  the  two  mentions 
appear  in  apposition.  This  information  is  extracted 
from  a  parse  tree,  and  can  be  computed  only  when  a 
parser  is  available; 


Data  Set 

Arabic 

Chinese 

English 

Train 

65.6k 

86.5k 

340.7k 

Development  Test 

7.7k 

7.2k 

71k 

Sep’03  Eval  Test 

93.5k 

108.2k 

60.7k 

Table  2;  Data  statistics  (number  of  tokens)  for  Arabic, 
Chinese  and  English 

Another  category  of  features  is  created  by  taking  con¬ 
junction  of  the  atomic  features.  For  example,  the  model 
can  capture  how  far  a  pronoun  mention  is  from  a  named 
mention  when  the  distance  feature  is  used  in  conjunction 
with  mention  information  feature. 

As  it  is  the  case  with  with  mention  detection  ap¬ 
proach  presented  in  Section  2,  most  features  used  here  are 
language-independent  and  are  instantiated  from  the  train¬ 
ing  data,  while  some  are  language-specific,  but  mostly 
because  the  resources  were  not  available  for  the  specific 
language.  For  example,  syntactic  features  are  not  used  in 
the  Arabic  system  due  to  the  lack  of  an  Arabic  parser. 

Simple  as  it  seems,  the  mention-pair  model  has  been 
shown  to  work  well  (Soon  et  ak,  2001;  Ng  and  Cardie, 
2002).  As  will  be  shown  in  Section  4,  the  relatively 
knowledge-lean  feature  sets  work  fairly  well  in  our  tasks. 

Although  we  also  use  a  mention-pair  model,  our 
tracking  algorithm  differs  from  Soon  et  al.  (2001), 
Ng  and  Cardie  (2002)  in  several  aspects.  First,  the 
mention-pair  model  is  used  as  an  approximation  to  the 
entity-mention  model  (3),  which  itself  is  an  approxima¬ 
tion  of  P{L  =  A  =  i).  Second,  instead  of 

doing  a  pick-first  (Soon  et  ak,  2001)  or  best-first  (Ng  and 
Cardie,  2002)  selection,  the  mention-pair  linking  model 
is  used  to  compute  a  starting  probability.  The  starting 
probability  enables  us  to  score  the  action  of  creating  a 
new  entity  without  thresholding  the  link  probabilities. 
Third,  this  probabilistic  framework  allows  us  to  search 
the  space  of  all  possible  entities,  while  Soon  et  ak  (2001), 
Ng  and  Cardie  (2002)  take  the  “best”  local  hypothesis. 

4  Experimental  Results 

The  data  used  in  all  experiments  presented  in  this  sec¬ 
tion  is  provided  by  the  Finguistic  Data  Consortium  and  is 
distributed  by  NIST  to  all  participants  in  the  ACE  evalua¬ 
tion.  In  the  comparative  experiments  for  the  mention  de¬ 
tection  and  entity  tracking  tasks,  the  training  data  for  the 
English  system  consists  of  the  training  data  from  both  the 
2002  evaluation  and  the  2003  evaluation,  while  for  Ara¬ 
bic  and  Chinese,  new  additions  to  the  ACE  task  in  2003, 
consists  of  80%  of  the  provided  training  data.  Table  2 
shows  the  sizes  of  the  training,  development  and  eval¬ 
uation  test  data  for  the  3  languages.  The  data  is  anno¬ 
tated  with  five  types  of  entities:  person,  organization, 
geo-political  entity,  location,  facility,  each  mention  can 
be  either  named,  nominal  or  pronominal,  and  can  be  ei¬ 
ther  generic  (not  referring  to  a  clearly  described  entity) 
or  specific. 

The  models  for  all  three  languages  are  built  as  joint 


models,  simultaneously  predicting  the  type,  level  and 
genericity  of  a  mention  -  basically  each  mention  is  la¬ 
beled  with  a  3-pronged  tag.  To  transform  the  problem 
into  a  classification  task,  we  use  the  IOB2  classification 
scheme  (Tjong  Kim  Sang  and  Veenstra,  1999). 

4.1  The  ACE  Value 

A  gauge  of  the  performance  of  an  EDT  system  is  the  ACE 
value,  a  measure  developed  especially  for  this  purpose.  It 
estimates  the  normalized  weighted  cost  of  detection  of 
specific-only  entities  in  terms  of  misses,  false  alarms  and 
substitution  errors  (entities  marked  generic  are  excluded 
from  computation):  any  undetected  entity  is  considered 
a  miss,  system-output  entities  with  no  corresponding  ref¬ 
erence  entities  are  considered /aAe  alarms,  and  entities 
whose  type  was  mis-assigned  are  substitution  errors.  The 
ACE  value  computes  a  weighted  cost  by  applying  differ¬ 
ent  weights  to  each  error,  depending  on  the  error  type  and 
target  entity  type  (e.g.  PERSON-NAMEs  are  weighted 
a  lot  more  heavily  than  FACILITY-PRONOUNs)  (NIST, 
2003a).  The  cumulative  cost  is  normalized  by  the  cost 
of  a  (hypothetical)  system  that  outputs  no  entities  at  all 

-  which  would  receive  an  ACE  value  of  0.  Finally,  the 
normalized  cost  is  subtracted  from  100.0  to  obtain  the 
ACE  value;  a  value  of  100%  corresponds  to  perfect  en¬ 
tity  detection.  A  system  can  obtain  a  negative  score  if  it 
proposed  too  many  incorrect  entities. 

In  addition,  for  the  mention  detection  task,  we  will  also 
present  results  by  using  the  more  established  F-measure, 
computed  as  the  harmonic  mean  of  precision  and  recall 

-  this  measure  gives  equal  importance  to  all  entities,  re¬ 
gardless  of  their  type,  level  or  genericity. 

4.2  EDT  Results 

As  described  in  Section  2.6,  the  mention  detection  sys¬ 
tems  make  use  of  a  large  set  of  features.  To  better  assert 
the  contribution  of  the  different  types  of  features  to  the  fi¬ 
nal  performance,  we  have  grouped  them  into  4  categories: 

1 .  Surface  features:  lexical  features  that  can  be  derived 
from  investigating  the  words:  words,  morphs,  pre¬ 
fix/suffix,  capitalization/word-form  flags 

2.  Features  derived  from  processing  the  data  with  NLP 
techniques:  POS  tags,  text  chunks,  word  segmenta¬ 
tion,  etc. 

3.  Gazetteer/dictionary  features 

4.  Features  obtained  by  running  other  named-entity 
classifiers  (with  different  tag  sets):  HMM,  MaxEnt 
and  RRM  output  on  the  32-category,  49-category 
and  MUC  data  sets.® 

Table  3  presents  the  mention  detection  comparative  re¬ 
sults,  F-measure  and  ACE  value,  on  Arabic  and  Chinese 
data.  The  Arabic  and  Chinese  models  were  built  using 

®In  the  English  MaxEnt  system,  which  uses  295k  features, 
the  distribution  among  the  four  classes  of  features  is:  1:72%, 
2:24%,  3:1%,  4:3%. 


Feature 

Sets 

Arabic 

Chinese 

F-measure 

ACE 

F-measure 

ACE 

1 

59.7 

43.1 

62.6 

51.1 

1-1-2 

60.8 

46.0 

67.1 

57.7 

1-I-2-I-3 

63.4 

51.8 

68.4 

67.7 

\+2+3+4 

68.5 

53.2 

68.6 

74.1 

Table  3:  Mention  detection  results  for  the  Arabic  and 
Chinese 


Arabic 

Chinese 

English 

Feb02 

Sept02 

ACE  value 

83.2 

89.4 

90.9 

88.0 

Table  4;  Entity  tracking  results  on  true  mentions 

the  RRM  model.  There  are  some  interesting  observa¬ 
tions:  first,  the  F-measure  performance  does  not  correlate 
well  with  an  improvement  in  ACE  value  -  small  improve¬ 
ments  in  F-measure  sometimes  are  paired  with  large  rela¬ 
tive  improvements  in  ACE  value,  fact  due  to  the  different 
weighting  of  entity  types.  Second,  the  largest  single  im¬ 
provement  in  ACE  value  is  obtained  by  adding  dictionary 
features,  at  least  in  this  order  of  adding  features. 

For  English,  we  investigated  in  more  detail  the  way 
features  interact.  Figure  1  presents  a  hierarchical  direct 
comparison  between  the  performance  of  the  RRM  model 
and  the  MaxEnt  model.  We  can  observe  that  the  RRM 
model  makes  better  use  of  gazetteers,  and  manages  to 
close  the  initial  performance  gap  to  the  MaxEnt  model. 

Table  4  presents  the  results  obtained  by  running  the  en¬ 
tity  tracking  algorithm  on  true  mentions.  It  is  interesting 
to  compare  the  entity  tracking  results  with  inter-annotator 
agreements.  LDC  reported  (NIST,  2003b)  that  the  inter¬ 
annotator  agreement  (computed  as  ACE-values)  between 
annotators  are  73.5%,  87.3%  and  87.8%  for  Arabic,  Chi¬ 
nese  and  English,  respectively.  The  system  performance 
is  very  close  to  human  performance  on  this  task;  this 
small  difference  in  performance  highlights  the  difficulty 
of  the  entity  tracking  task. 

Finally,  Table  5  presents  the  results  obtained  by  run¬ 
ning  both  mention  detection  followed  by  entity  tracking 
on  the  ACE’ 03  evaluation  data.  Our  submission  in  the 
evaluation  performed  well  relative  to  the  other  partici¬ 
pating  systems  (contractual  obligations  prevent  us  from 
elaborating  further). 

4.3  Discussion 

The  same  basic  model  was  used  to  perform  EDT  in  three 
languages.  Our  approach  is  language-independent,  in  that 


Arabic 

Chinese 

English 

RRM 

MaxEnt 

ACE  value 

54.5 

58.8 

69.7 

73.4 

Table  5:  ACE  value  results  for  the  three  languages  on 
ACE’ 03  evaluation  data. 


Figure  1 :  Performance  of  the  English  mention  detection 
system  on  different  sets  of  features  (uniformly  penalized 
F-measure),  September’ 02  data.  The  lower  part  of  each 
box  describes  the  particular  combination  of  feature  types; 
the  arrows  show  a  inclusion  relationship  between  the  fea¬ 
ture  sets. 

the  fundamental  classification  algorithm  can  be  applied  to 
every  language  and  the  only  changes  involve  finding  ap¬ 
propriate  and  available  feature  streams  for  each  language. 
The  entity  tracking  system  uses  even  fewer  language- 
specific  features  than  the  mention  detection  systems. 

One  limitation  apparent  in  our  mention  detection  sys¬ 
tem  is  that  it  does  not  model  explicitly  the  genericity  of 
a  mention.  Deciding  whether  a  mention  refers  to  a  spe¬ 
cific  entity  or  a  generic  entity  requires  knowledge  of  sub¬ 
stantially  wider  context  than  the  window  of  5  tokens  we 
currently  use  in  our  mention  detection  systems.  One  way 
we  plan  to  improve  performance  for  such  cases  is  to  sep¬ 
arate  the  task  into  two  parts:  one  in  which  the  mention 
type  and  level  are  predicted,  followed  by  a  genericity- 
predicting  model  which  uses  long-range  features,  such  as 
sentence  or  document  level  features. 

Our  entity  tracking  system  currently  cannot  resolve  the 
coreference  of  pronouns  very  accurately.  Although  this  is 
weighted  lightly  in  ACE  evaluation,  good  anaphora  res¬ 
olution  can  be  very  useful  in  many  applications  and  we 
will  continue  exploring  this  task  in  the  future. 

The  Arabic  and  Chinese  EDT  tasks  were  included  in 
the  ACE  evaluation  for  the  first  time  in  2003.  Unlike 
the  English  case,  the  systems  had  access  to  only  a  small 
amount  of  training  data  (60k  words  for  Arabic  and  90k 
characters  for  Chinese,  in  contrast  with  340k  words  for 
English),  which  made  it  difficult  to  train  statistical  mod¬ 
els  with  large  number  of  feature  types.  Future  ACE  evalu¬ 
ations  will  shed  light  on  whether  this  lower  performance, 
shown  in  Table  3,  is  due  to  lack  of  training  data  or  to 
specific  language-specific  ambiguity. 

The  final  observation  we  want  to  make  is  that  the  sys¬ 
tems  were  not  directly  optimized  for  the  ACE  value,  and 
there  is  no  obvious  way  to  do  so.  As  Table  3  shows,  the 
F-measure  and  ACE  value  do  not  correlate  well:  systems 
trained  to  optimize  the  former  might  not  end  up  optimiz- 


ing  the  latter.  It  is  an  open  research  question  whether  a 
system  can  be  directly  optimized  for  the  ACE  value. 

5  Conclusion 

This  paper  presents  a  language-independent  framework 
for  the  entity  detection  and  tracking  task,  which  is  shown 
to  obtain  top-tier  performance  on  three  radically  differ¬ 
ent  languages:  Arabic,  Chinese  and  English.  The  task  is 
separated  into  two  sub-tasks:  a  mention  detection  part, 
which  is  modeled  through  a  named  entity-like  approach, 
and  an  entity  tracking  part,  for  a  which  a  novel  modeling 
approach  is  proposed. 

This  statistical  framework  is  general  and  can  incor¬ 
porate  heterogeneous  feature  types  —  the  models  were 
built  using  a  wide  array  of  lexical,  syntactic  and  seman¬ 
tic  features  extracted  from  texts,  and  further  enhanced 
by  adding  the  output  of  pre-existing  semantic  classifiers 
as  feature  streams;  additional  feature  types  help  improve 
the  performance  significantly,  especially  in  terms  of  ACE 
value.  The  experimental  results  show  that  the  systems 
perform  remarkably  well,  for  both  well  investigated  lan¬ 
guages,  such  as  English,  and  for  the  relatively  new  addi¬ 
tions  Arabic  and  Chinese. 
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